A New Monte Carlo Algorithm for Protein Folding 
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We demonstrate that the recently introduced pruned-enriched Rosenbluth method (PERM) 
leads to extremely efficient algorithms for the folding of simple model proteins. We test them on 
several models for lattice heteropolymers, and compare to published Monte Carlo studies. In all 
cases our algorithms are faster than previous ones, and in several cases we find new minimal energy 
states. In addition, our algorithms give estimates for the partition sum at finite temperatures. 
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Protein folding [|l| is one of the outstanding problems 
in mathematical biology. It is concerned with the prob- 
lem of how a given sequence of amino acids assumes pre- 
cisely that geometrical shape which is biologically useful. 
Currently it is much easier to find coding DNA (and, 
thus, amino acid) sequences than to find the correspond- 
ing structures. Thus, solving the folding problem would 
be a major break-through in understanding the biochem- 
istry of the cell, and in designing artificial proteins. 

In this Letter we are concerned only with the most 
straightforward direct approach: given a sequence, a 
molecular potential and no other information, find the 
ground state and the equilibrium state at physiological 
temperatures. Note that we are not concerned with the 
kinetics of folding, but only in the final outcome. Also, 
we will not address the problems of how to find good 
molecular potentials, and what is the proper level of de- 
tail in describing proteins. Instead, we will use simple 
coarse-grained models which have been proposed in the 
literature and have become standards in testing the effi- 
ciency of folding algorithms. 

The models we study are heteropolymers which live on 
3-d or 2-d regular lattices. They are self-avoiding chains 
with attractive or repulsive interactions between neigh- 
boring non-bonded monomers. These interactions can 
have continuous distributions ||], but the majority of au- 
thors considered only two kinds of monomers. In the 
HP model they are hydrophobic (H) and polar (P), 
with (e//H, e_ffp, epp) = —(1,0,0). Since this leads to 
highly degenerate ground states, alternative models were 
proposed, e.g. e = -(3,1,3) § and e = -(1,0,1) §. 

The algorithms we apply here are variants of the 
pruned-enriched Rosenbluth method (PERM) , a chain 
growth algorithm based on the Rosenbluth-Rosenbluth 
(RR) [|j method. Monomers are added sequentially, the 
n-th monomer being placed at site i with probability 
Pn(i)- In simple sampling, Pn(i) is uniform on all neigh- 
bors of the last monomer, leading to exponential attri- 
tion. The original RR method avoids this by using a 
uniform Pnii) on all vacant neighbors of in-i- More gen- 
erally, we call any non-uniform choice of Pn(i) a gen- 



eralized RR method. The relative thermal weight of a 
particular chain conformation of length n is then deter- 
mined by Wn = m„ exp(— /3Ai?„)IF„_i, with Wi = 1; 
AEn is the energy gain from adding monomer n; and 
m„ is the Rosenbluth factor, m„ = Z]je{nn} Pnij) / Pn{i) ■ 
We note that Wn is also an estimate for the partition 
function Z„ of the n- monomer chain 0|. Chain growth 
is stopped when the final size is reached and started new 
from n = 1. 

In easy cases, Pn{i) can be chosen so that Boltzmann 
and Rosenbluth factors - or Rosenbluth factors for dif- 
ferent n - cancel, leading to narrow weight distributions. 
But in general, this algorithm produces a wide spread in 
weights that can lead to serious problems On the 
other hand, since the weights accumulate as the chain 
grows, one can interfere during the growth process by 
'pruning' conformations with low weights and enriching 
high- weight conformations. This is, in principle, simi- 
lar to population based methods in polymer simulations 
0,0 and in quantum Monte Carlo (MC) ||l||. However, 
our implementation is different. Pruning is done stochas- 
tically: if the weight of a conformation has decreased 
below a threshold W^, it is eliminated with probability 
1/2, while it is kept and its weight is doubled in the other 
half of cases. Enrichment jl^ is done independently of 
this: if Wn increases above another threshold , the 
conformation is replaced by Hc copies, each with weight 
Wn/nc- Technically, this is done by putting onto a stack 
all information about conformations which still have to 
be copied. This is most easily implemented by recursive 
function calls |Q . Thereby the need for keeping large pop- 
ulations of conformations pO|-p^ is avoided. PERM has 
proven extremely efficient for studies of lattice homopoly- 
mers near the 6 point ||^, their phase equilibria and 
of the ordering transition in semi-stiff polymers |15| . 

The main freedom when applying PERM consists in 
the a priori choice of the sites where to place the next 
monomer, i.e. the probabilities Pn{j), in the thresholds 
W^ and W^ for pruning and enrichment, and in the 
number of copies, Uc, made upon enrichment. All these 
features do not effect the correctness of the algorithm. 
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but they can greatly influence its efliciency. They may 
depend arbitrarily on chain lengths and on local confor- 
mations, and they can be changed freely at any time 
during the simulation. Thus, the algorithm can 'learn' 
during the simulation |Q| . 

In order to apply PERM to heteropolymers at very 
low temperatures, the strategies proposed in Q are mod- 
ified as follows. 

(1) For homopolymers near the theta-point it had 
been found that the best choice for the placement of 
monomers was not according to their Boltzmann weights, 
but uniformly on all allowed sites like in the orig- 
inal RR. This is due to cancellation between Boltzmann 
and RR factors: larger Boltzmann factors correspond to 
higher densities and, thus, to smaller RR factors 0. 

For a heteropolymer this has to be modified, as there 
is no longer a unique relationship between density and 
Boltzmann factor. In a strategy of 'anticipated impor- 
tance sampling' we should preferentially place monomers 
in sites with mostly attractive neighbors. Assume that 
we have two kinds of monomers and we want to place a 
type-j4 monomer. If an allowed site has ms neighbors of 
type B {B = H,P), we select this site with probability 
oc 1 -I- aAHiTiH + dApmp. Here, uab are constants with 
aAB > for CAB < and vice versa. 

(2) Most naturally, the and are chosen pro- 
portional to the estimated partition sum Z„ (i.e. the 
average of the Wn already generated) : e.g. — cZ„, 
c « 0.5, and W> = rW,f , r « 10 0. But this becomes 
inefficient at very low T since Z„ will be underestimated 
as long as no low-energy state is found. But when this 
finally happens, is too small and, thus, too many 
(correlated) copies are produced. This costs CPU time 
but does not increase the quality of sampling. 

This problem could be avoided by increasing 
and during particularly successful 'tours' (a tour 
is the set of conformations derived from a single start 
0). But then also the average number of long chains 
is decreased in comparison with short ones. To re- 
duce this effect and to create a bias towards a sam- 
ple which is flat in chain length, we multiply by some 



power of Mn/Mi, where M„ is the number of gener- 
ated chains of length n. With Af{n) denoting the num- 
ber of chains generated during the current tour we used 
W< Zn [il+Af{n)/M){M^ + M)/{Mi + M)]\ with 
C a constant of order unity and Ai a constant of order 
10"* - 10^. 

(3) For the number of copies Uc created when Wn 
surpasses W^, a good choice is int[l -I- \/Wn/W>]. 

(4) In some cases we did not start to grow the chain 
from one end but from a point in the middle. We grew 
first into one direction, and then into the other. Results 
were averaged over all possible starting points. The idea 
behind this is that real proteins have folding nuclei, and 
it should be most efficient to start from such a nucleus. In 
some cases this approach was very successful and speeded 
up the ground state search substantially, in others not. 

(5) Special tricks were employed for 'compact' config- 
urations filling a square or a cube |p^ . 

Let us now discuss our results. Items (a) to (c) con- 
cern the original HP model [|| with e = —(1,0,0). 

(a) Ten sequences of length = 48 were given in . 
Each was designed by minimizing the energy of a particu- 
lar target conformation under the constraint of constant 
composition. The authors tried to find lowest energy 
states with a heuristic MC method ||l^, and an exact 
enumeration of low energy states (which cannot be 
generalized to other models). The MC method failed in 
all but one case. Precise CPU times were not quoted. 
With PERM we succeeded to reach ground states in all 
cases, average CPU time per sequence ranging from few 
seconds to several hours (all times refer to a SUN UL- 
TRA SPARC, 167 MHz). We verified also that these 
ground states are highly degenerate, and that there are 
no gaps between ground and first excited states. Thus, 
none of these sequences are good folders, though they 
were designed specifically for this purpose. 

(b) In two versions of a genetic algorithm were 
used to simulate 2-d HP chains of lengths 20 to 64, and 
compared to other MC algorithms. Ground state ener- 
gies were supposed to be known since the chains had been 
specially designed. In all cases we reached the ground 



TABLE I. Newly found lowest energy states for binary sequences with interactions e* — [euH, ^hp, epp). Configurations are 
encoded as sequences of r(ight),/(eft), tt(p), rf(own), /(orward), and 6(ackward). 
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Ref. 
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-(1,0,0) 
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state energies proposed by the authors in less than Ih 
CPU time, except for the sequence of length 64. For 
that sequence we obtained E = —39, while none of the 
algorithms used in Q reached energies below —37. For 
the chain with length 60, we found several states with 
E — —36 although the authors had claimed E > —34 by 
construction (see Table I). 

(c) Two 2-d HP chains with N = 100 were studied in 
pl[ . The authors claimed that the native conformations 
are compact, fit exactly into a 10 x 10 square, and have 
energies —44 and —46. These energies were found by a 
specially designed MC algorithm which should be par- 
ticularly efRcient for compact conformations. Wc found 
non-compact (degenerate) conformations with energies 
—47 and —49 (see Table I), while our lowest energy com- 
pact states (also degenerate) have E ~ —46 and —47 

(d) Sequences with N — 27 and with continuous in- 
teractions were studied in |^ . Interaction strengths were 
sampled from Gaussians and were permuted to obtain 
good folders. In all cases we could reach the supposed 
ground state energies, within less than Ih in the worst 
case. This time the design had been successful, and all 
sequences showed gaps between the ground state and the 
bulk of low-lying states. These gaps were in some cases 
filled by conformations which were very similar to the 
ground state, so that they could not prevent these se- 
quences to be counted as good folders. In no case we 
found energies lower than those quoted in ||]. 

(e) Short sequences with which we had neither prob- 
lems nor surprises were given in several papers: 48-mers 
in 27-mers in ||2^, and sequences with TV < 36 in [ p^ . 

(f) The most interesting case is a 2-species 80-mer 
with interactions —(1,0,1) studied first in These 
particular interactions were chosen because it was hoped 
that they would lead to compact conformations. Indeed, 
the sequence was specially designed to form a "four he- 
lix bundle" which fits perfectly into a 4 x 4 x 5 box (see 
Fig. 1). Its energy in this putative native state is —94. 
Although the authors of @ used highly optimized codes, 
they were not able to recover this state by MC. Instead, 
they reached only E = —91. Supposedly, a different state 
with E = -94 was found in but Fig. 10 of this pa- 
per, which is claimed to show this conformation, has a 
much higher value of E. 

Even without much tuning our algorithm gave E = 
—94 after a few hours, but it did not stop there. After 
a number of conformations with successively lower ener- 
gies, the final candidate for the native state has E — —98. 
It again has a highly symmetric shape, although it does 
not fit into a 4 x 4 x 5 box (see Fig. 2). It has twofold de- 
generacy (the central 2x2x2 box in the front of Fig. 2 
can be flipped), and both conformations were actually 
found in the simulations. Optimal parameters for the 
ground state search in this model are /3 = 1/kT « 2.0, 



app = qhh ~ 2, and uhp ~ —0.13. With these, average 
CPU times for finding E = -94 and E = -98 are ca. 20 
min and 80 hours, respectively [ p3[ . 

A surprising result is that the monomers are arranged 
in four homogeneous layers in Fig. 2, while they had 
formed only three layers in the putative ground state of 
Fig. 1 . Since the interaction should favor the segregation 
of different type monomers, one might have guessed that 
a conformation with a smaller number of layers should 




FIG. 1. Putative native state of the "four helix bundle" 
sequence, as proposed in Q|. It has E = —94, fits into a 
rectangular box, and consists of three homogeneous layers. 
Structurally, it can be interpreted as four helix bundles. 




FIG. 2. Conformation of the "four helix bundle" se- 
quence with E = —98. We propose that this is the actual 
ground state. Its shape is highly symmetric although it does 
not fit into a rectangular box. It is not degenerate except 
for a flipping of the central front 2x2x2 box. 



3 



be favored. We see that this is outweighed by the fact 
that both monomer types can form large double layers 
in the new conformation. Again, our new ground state is 
not 'compact' in the sense of minimizing the surface, and 
hence it disagrees with the wide spread prejudice that 
native states are compact. 



helix bundle" sequence has a small gap and has low de- 
generacy because of the modified interaction strengths. 
But it folds only at very low T, and should not be a good 
folder either. 

The authors thank G. Barkema, E. Domany, and M. 
Vendruscolo for discussions, and D.K. Klimov and R. Ra- 
makrishnan for correspondence. 




FIG. 3: Specific heat (heat capacity per monomer) of the 
80-mer "four helix bundle" vs T. 

We also constructed histograms of the energy distri- 
bution. Combining them with similar histograms ob- 
tained at higher temperatures we obtained average 
energies and heat capacities. The specific heat (Fig. 3) 
shows a large peak at T = 0.62 and two shoulders (at 
T « 0.45 and 1.0), all of which are statistically signif- 
icant. As shown by a more detailed analysis ||l6|, the 
shoulder at T = 1 is due to the collapse from open coil 
to molten globule, while that at T = 0.45 is due to the 
folding into the native state. There seems to be no state 
with E = —97, and very few states with E = —96 and 
—95, leading to an effective gap between E — —94 and 
E = —98. The main peak seems related to the formation 
of - mostly misfolded (helix-dominated) - secondary and 
tertiary structure. The low-temperature phase, however, 
contains mostly /3-sheets (see Fig. 2). 

In summary, we showed that the pruned-enriched 
Rosenbluth method can be very effectively applied to 
protein structure prediction in simple lattice models. It 
is suited for calculating statistical properties and is very 
successful in finding native states. In all cases it did bet- 
ter than any previous MC method, and in many cases it 
found lower states than those which had previously been 
conjectured to be native. Especially, we have presented 
a new candidate for the native conformation of a "four 
helix bundle" sequence which had been studied before by 
several authors. We verified that ground states of the HP 
model are highly degenerate and have no gap, leading to 
bad folders. In contrast, the ground state of the "four 
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